library(fpp3)03_C_TSGraphs_Lagplots_Autocorrelation
References
This is not original material but a compilation of the following sources with some additional examples and clarifications introduced by the professor:
- Hyndman, R.J., & Athanasopoulos, G., 2021. Forecasting: principles and practice. 3rd edition.
- Additional material provided by the book authors
Lagged variables and time-plots
Given a time series we can compute its lags by shifting it in time a given number of steps. The lagged variable simply contains past information of the variable, which has been shifted as many steps as indicated by the number of the lag.
As with most things in life, this is better understood with an example:
recent_beer <- aus_production %>%
filter(year(Quarter) >= 2000) %>%
select(Quarter, Beer)
recent_beer# A tsibble: 42 x 2 [1Q]
Quarter Beer
<qtr> <dbl>
1 2000 Q1 421
2 2000 Q2 402
3 2000 Q3 414
4 2000 Q4 500
5 2001 Q1 451
6 2001 Q2 380
7 2001 Q3 416
8 2001 Q4 492
9 2002 Q1 428
10 2002 Q2 408
# ℹ 32 more rows
With this code, we generate the lagged values of the time series corresponding to the variable Beer:
for (i in seq(1, 4)) {
lag_name = paste0("Beer_lag", as.character(i))
recent_beer[[lag_name]] = lag(recent_beer[["Beer"]], i)
}
recent_beer# A tsibble: 42 x 6 [1Q]
Quarter Beer Beer_lag1 Beer_lag2 Beer_lag3 Beer_lag4
<qtr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 Q1 421 NA NA NA NA
2 2000 Q2 402 421 NA NA NA
3 2000 Q3 414 402 421 NA NA
4 2000 Q4 500 414 402 421 NA
5 2001 Q1 451 500 414 402 421
6 2001 Q2 380 451 500 414 402
7 2001 Q3 416 380 451 500 414
8 2001 Q4 492 416 380 451 500
9 2002 Q1 428 492 416 380 451
10 2002 Q2 408 428 492 416 380
# ℹ 32 more rows
If we call beer \(y_t\) (the time series we are studying), the formal notation for the variables above would be:
- \(y_t \rightarrow\)
Beerproduction, the variable we are studying - \(y_{t-1} \rightarrow\)
Beer_lag1- first lag of the variable beer - \(y_{t-2} \rightarrow\)
Beer_lag2- second lag of the variable beer - \(y_{t-3} \rightarrow\)
Beer_lag3- third lag of the variable beer - \(y_{t-4} \rightarrow\)
Beer_lag4- fourth lag of the variable beer
It is important to remark that:
- Because there is no information to fill the void generated when shifting the variable, these voids are filled with NAs.
- Note that the lagged variables are time series, but shorter than the original variable due to these NAs.
- The rest of the values of the time series temain the same
This results in the following structure of NAs as the lag number increases:
Let us now create a timeplot of these lags. If we focus for example on Beer (\(y_t\)) and Beer_lag4 (\(y_{t-4}\)), we observe that each value of the variable Beer is confronted to its value 4 time steps ago. For example the value of beer at 2001 Q1 is next to its value at 2000 Q1 (exactly four time-steps ago)
recent_beer %>% select(Beer, Beer_lag4)# A tsibble: 42 x 3 [1Q]
Beer Beer_lag4 Quarter
<dbl> <dbl> <qtr>
1 421 NA 2000 Q1
2 402 NA 2000 Q2
3 414 NA 2000 Q3
4 500 NA 2000 Q4
5 451 421 2001 Q1
6 380 402 2001 Q2
7 416 414 2001 Q3
8 492 500 2001 Q4
9 428 451 2002 Q1
10 408 380 2002 Q2
# ℹ 32 more rows
The code below creates a time-plot of both variables (change n_lag if you wish to depict another lag):
n_lag = 4
lag_name = paste0("Beer_lag", n_lag)
recent_beer %>%
autoplot() +
scale_x_yearquarter(breaks = "1 years",
minor_breaks = "1 year") +
geom_line(aes_string(x = "Quarter", y = lag_name), # Call attention upon aes_string
color = "red",
linetype = "dashed")Plot variable not specified, automatically selected `.vars = Beer`
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
Warning: Removed 4 rows containing missing values (`geom_line()`).
We can see that, for the 4-th lag, the lagged variable and the original variable are perfectly in phase. This is because this series has a seasonal period of 1 year (4 quarters) and simple seasonality (meaning only this yearly seasonality is present).
Lagged-variables and scatterplots. Autocorrelation.
Lagged variables enable us to study the extent to which the value of the series at time t (the original series) depends upon the values of the series at past time steps.
Let us compute a scatterplot depicting the relation between the fourth lag of the variable beer and the original variable. That is, we are going to forget about time and are simply going to confront values of the series Beer and its fourth lag Beer_lag4. These are the values signalled in yellow in the tsibble below (note that NAs have of course been excluded):
The code to attain that is:
recent_beer %>%
gg_lag(y = Beer, geom = "point", lags = 4)We could summarize the degree of linear relationship between these two variables using, for example the pearsons correlation coefficient. However, we will see there is a better metric in the case of time series
Let us compute these scatterplots for a number of lags. This is what is calles a lagplot of the time series:
recent_beer %>%
gg_lag(y = Beer, geom = "point", lags = 1:12)In the figure above, you see that the relationship between the variable and the lagged variables at multiples of the seasonal period (m = 4, 8, 12…) is much stronger.
Autocorrelation vs. Pearsons correlation coefficient
Let us recall the formula for the Pearson’s correlation coefficient between two variables x and y
If we applied this formula to each of the lagplots in the previous section, the number of terms in the sums in the denominator would decrease as the lag number increases. This would be due to the fact that the lagged time series has less points than the original series.
We can use this fact to create a new correlation coefficient that considers the fact that, the further away in the past a lag is, the smaller its influence over the original time series should be. This is what we call the autocorrelation coefficient between the time series \(y_t\) and its k-th lag \(y_{t-k}\):
\[\begin{align*} r_{k} = \frac{\sum\limits_{t=k+1}^T (y_{t}-\bar{y})(y_{t-k}-\bar{y})} {\sum\limits_{t=1}^T (y_{t}-\bar{y})^2} \end{align*}\]In the above formula:
- \(t = 1\) is the first point in the time series
- \(t = T\) is the last poitn for which we have recorded data
- The denominator remains the same for the correlation coefficient of every lag. The sum in the denominator extends over the totality of the original time series (from \(t=1\) to \(t=T\)).
- The numerator has a decreasing number of terms as \(k\) (the lag-number) increases. The sum in the numerator extends from \(k+1\) to \(T\).
- Example: if \(k = 1\) (first lag), the first value of the lagges seris (\(t=1\)) is an
NA. The sum extends therefore from \(t=k+1=2\) until the end
This simple fact inherently decreases the autocorrelation coefficient of lags that are further away in the past because the denominator remains constant while the numerator decreases with increasing k.
IMPORTANT: this does not mean that autocorrelation coefficients in the past cannot be high. It means that the autocorrelation coefficient has this built-in feature that scales the coefficient depending on the point in time it is referring to.
Computing the autocorrelation coefficients in R
The autocorrelation coefficients may be easily computed using the function ACF()
recent_beer %>%
ACF()Response variable not specified, automatically selected `var = Beer`
# A tsibble: 16 x 2 [1Q]
lag acf
<cf_lag> <dbl>
1 1Q -0.0530
2 2Q -0.758
3 3Q -0.0262
4 4Q 0.802
5 5Q -0.0775
6 6Q -0.657
7 7Q 0.00119
8 8Q 0.707
9 9Q -0.0888
10 10Q -0.588
11 11Q 0.0241
12 12Q 0.632
13 13Q -0.0660
14 14Q -0.505
15 15Q -0.00891
16 16Q 0.498
ACF Plot (a.k.a Correlogram)
If we now compute the autocorrelation coefficient corresponding to each lag, we can create a graph with the following axes:
- x-axis: the lag number.
- y-axis: the value of the corresponding autocorrelation coefficient.
This is what we call the lagplot. The lagplot can be easily depicted as follows:
recent_beer %>%
ACF() %>%
autoplot()Response variable not specified, automatically selected `var = Beer`
Number of lags to be depicted
With the argument lag_max we can specify the number of lags to be computed and subsequently depicted in the correlogram. This is important because, for seasonality to appear in the correlogram, a sufficient number of lags needs to be computed. See the upcoming sections for further details.
recent_beer %>%
ACF(lag_max = 24) %>%
autoplot()Response variable not specified, automatically selected `var = Beer`
Tend and seasonaity in ACF plots
When data have a trend:
- The autocolleration coefficients for small lags (please note the small lags appreciation) tend to be large and positive because observations nearby in time are also nearby in value. And so the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase. - Please focus only on the region circled in red and not on the region where the ACF becomes negative in this case. What is important to recognize the trend is this slowly decreasing ACFs that are positive at the beginning.
- Note that the pattern is the same for a negative trend:
When data are seasonal:
- The autocorrelation will be larger for the seasonal lags (at multiples of the seasonal period) than for other lags..
- NOTE: the autocorrelation at higher multiples of the seasonal period will be smaller than at lower multiples of the seasonal period because the former are further away in the past and the formula for the autocorrelaiton coefficient accounts for this fact, as we previously explained.
When data are both trended and seasonal
- You see a combination of these effects. Depending on the relative strength of the trend and teh seasonality, the pattern varies. Below a description of some of the most common cases. These have been generated using this shiny app:
Web-app - Time Series autocorrelation - basic patterns
- Seasonal and trend data are of the same order (one does not dominate over the other):
Let us look at the example of the sales of Australian antidiabetic drugs:
- Seasonal component weaker than trend component: waves due to seasonal component can be seen on the ACF… but the ACF alone does not reveal the actual seasonal period (lag 12 not higher than the rest). This does not mean that there is no seasonality, only that it is not visible on the ACF. With time series decomposition (next session), we will be able to extract the seasonal component.
- Seasonal component much weaker than trend component: ACF appears to be that of a trend. However this does not mean that there is no seasonality, only that the ACF does not reflect it. With time series decomposition (next session), we will be able to extract the seasonal component.
Another relevant case to consider is when the trend component is much weaker than the seasonal component. Then the trend would barely be noticeable on the correlogram. Again, this does not mean that there is no trend, it just means that the correlogram does not reflect it because it is masked by the seasonal component.
White noise
Time series that show no autocorrelation are called white noise.
# Sets a seed to ensure repeatibility of random functions
set.seed(30)
y <- tsibble(sample = 1:50, wn = rnorm(50), index = sample)
y %>% autoplot(wn) + labs(title = "White noise", y = "")In particular, the above series samples from a gaussian distribution. In this case the noise is called gaussian white noise. The process for generating such a gaussian white noise is sampling from a normal distribution as follows:
Lets have a look at the correlogram:
y %>%
ACF(wn) %>%
autoplot() + labs(title = "White noise")For white noise series, autocorrelation is expected to be close to 0, but there is some random variation that will make it deviate from zero.
- We expect 95% of the spikes in the ACF to lie within \(\pm 2/\sqrt{T}\), where \(T\) is the length of the time series.
- Checking this is similar to performing separate statistical tests on each autocorrelation coefficients.
- Therefore, we may have false positives.
If:
- One or more large spikes are outside these bounds
- Substantially more than 5% of the spikes are outside these bounds
Then the time series is probably not white noise..
Later on in the course we will see how to perform statistical statistical that encompass multiple autocorrelation coefficients. For now the above criteria should suffice.
Exercise 1 - ACF Plot patterns
With the concepts we have seen in this class, you should be able to tell which time series corresponds to each correlogram:
3 - D - series exhibits a clear trend and D is the only correlogram that matches
this. We also observe spikes at lag 12. Since this is monthly data, this
corresponds to yearly seasonality. It matches the data structure.
4 - C - the series exhibits cycles of approximately 10 years. These cycles seem
pretty regular. The only correlogram showing this behavior is correlogram C.
Remember these are cycles and not seasonality. We are dealing with yearly data.
2 - A - correlogram exhibits yearly seasonality (spikes at lag 12 for monthly data),
matching the time series. No trend in the data.
1 - Signal from sensor. More erratic behavior and a bit of downwards trend.
No seasonality and a bit of trend. The only correlogram matching this is B.Exercise 2 - time series graphs
Use the following graph functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features of the time series Total Private from the dataset us_employment (loaded with fpp3):
us_employment# A tsibble: 143,412 x 4 [1M]
# Key: Series_ID [148]
Month Series_ID Title Employed
<mth> <chr> <chr> <dbl>
1 1939 Jan CEU0500000001 Total Private 25338
2 1939 Feb CEU0500000001 Total Private 25447
3 1939 Mar CEU0500000001 Total Private 25833
4 1939 Apr CEU0500000001 Total Private 25801
5 1939 May CEU0500000001 Total Private 26113
6 1939 Jun CEU0500000001 Total Private 26485
7 1939 Jul CEU0500000001 Total Private 26481
8 1939 Aug CEU0500000001 Total Private 26848
9 1939 Sep CEU0500000001 Total Private 27468
10 1939 Oct CEU0500000001 Total Private 27830
# ℹ 143,402 more rows
total_private <- us_employment %>%
filter(Title == "Total Private")
total_private# A tsibble: 969 x 4 [1M]
# Key: Series_ID [1]
Month Series_ID Title Employed
<mth> <chr> <chr> <dbl>
1 1939 Jan CEU0500000001 Total Private 25338
2 1939 Feb CEU0500000001 Total Private 25447
3 1939 Mar CEU0500000001 Total Private 25833
4 1939 Apr CEU0500000001 Total Private 25801
5 1939 May CEU0500000001 Total Private 26113
6 1939 Jun CEU0500000001 Total Private 26485
7 1939 Jul CEU0500000001 Total Private 26481
8 1939 Aug CEU0500000001 Total Private 26848
9 1939 Sep CEU0500000001 Total Private 27468
10 1939 Oct CEU0500000001 Total Private 27830
# ℹ 959 more rows
autoplot(total_private) +
scale_x_yearmonth(breaks = "5 years",
minor_breaks = "1 year") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))Plot variable not specified, automatically selected `.vars = Employed`
total_private %>% gg_season(Employed)gg_subseries(total_private, Employed)gg_lag(total_private, Employed, lags = 1:12)autoplot(ACF(total_private, Employed))In all of these plots, the trend is so dominant that it is hard to see anything
else.
In particular the lagplots and correlogram show very strong and positive correlation
between adjacent lags that decay very slowly. The effect of seasonality is much
smaller than the effect of the trend and cannot be perceived by visual inspection
of the correlogram.
We will need to remove the trend to explore other features of the data.
The next chapter (Time Series decomposition) will show how to identify the different
components of the time series.Exercise 3 - time series graphs
The PBS dataset included in fpp3 contains Monthly Medicare Australia prescription data. Run ?PBS on the console for further info.
We would like to focus on the prescriptions with ATC (Anatomical Therapeutic Chemical Index level 2) equal to H02. This group corresponds to corticosteroids for systemic use.
PBS %>%
filter(ATC2 == "H02") %>%
select(ATC2, Month, Cost, everything()) %>%
arrange(Month)# A tsibble: 816 x 9 [1M]
# Key: Concession, Type, ATC1, ATC2 [4]
ATC2 Month Cost Concession Type ATC1 ATC1_desc ATC2_desc Scripts
<chr> <mth> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 H02 1991 Jul 317384 Concessional Co-paym… H Systemic… CORTICOS… 63261
2 H02 1991 Jul 82929 Concessional Safety … H Systemic… CORTICOS… 12211
3 H02 1991 Jul 20178 General Co-paym… H Systemic… CORTICOS… 4013
4 H02 1991 Jul 9304 General Safety … H Systemic… CORTICOS… 2101
5 H02 1991 Aug 269891 Concessional Co-paym… H Systemic… CORTICOS… 53528
6 H02 1991 Aug 100602 Concessional Safety … H Systemic… CORTICOS… 14827
7 H02 1991 Aug 16519 General Co-paym… H Systemic… CORTICOS… 3397
8 H02 1991 Aug 13894 General Safety … H Systemic… CORTICOS… 2970
9 H02 1991 Sep 269703 Concessional Co-paym… H Systemic… CORTICOS… 52822
10 H02 1991 Sep 128900 Concessional Safety … H Systemic… CORTICOS… 18896
# ℹ 806 more rows
- As you can see, after filtering for
ATC2 == H02there are 4 entries per month corresponding to different combinations of the columnsConcessionandType. We would like to add the Cost associated to those 4 entries for every month, to obtain a time series of the total cost per month. Useindex_by()andsummarise()to get the total cost per month
h02_monthly <- PBS %>%
filter(ATC2 == "H02") %>%
index_by(Month) %>%
summarise(
Cost = sum(Cost)
)
h02_monthly# A tsibble: 204 x 2 [1M]
Month Cost
<mth> <dbl>
1 1991 Jul 429795
2 1991 Aug 400906
3 1991 Sep 432159
4 1991 Oct 492543
5 1991 Nov 502369
6 1991 Dec 602652
7 1992 Jan 660119
8 1992 Feb 336220
9 1992 Mar 351348
10 1992 Apr 379808
# ℹ 194 more rows
- Create and interpret the following graphs
- Time-plot with sufficient resolution to visually identify the seasonality
- Seasonal plot and seasonal subseries plots to identify the evolution of the seasonal component.
- ACF plots and lag plots to identify the seasoanlity
h02_monthly %>%
autoplot(Cost) +
# We use scale_x_yearmonth() because these are the units of the time index
scale_x_yearmonth(date_breaks = "1 year",
minor_breaks = "1 year") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))The graph revels yearly seasonality as well as an upwards trend. Further, the
variability in the seasonal pattern seems to be proportional to the level
of the trend.
There is also a noteworthy drop every February.h02_monthly %>%
gg_season(Cost)The graph reveals the underlying upwards trend of the series, as well as the drop occurring every February.h02_monthly %>%
gg_subseries(Cost)The trends have been greater in the higher peaking months like august, september, october, november, december...h02_monthly %>%
gg_lag(Cost, geom='point', lags=1:16)h02_monthly %>%
ACF(Cost) %>% autoplot()The strong yearly seasonality is clear in the spike of the ACF every 12 lags().
The lagplots show a separate cluster of points in almost every graph that matches
the large sales in January.Exercise 4 - time series graphs
The us_gasoline dataset, loaded with fpp3, contains weekly data beginning on Week 6 1991 and ending on Week 3 2017. Units are “million barrels per day”. This is the time series you need to analyse
head(us_gasoline)# A tsibble: 6 x 2 [1W]
Week Barrels
<week> <dbl>
1 1991 W06 6.62
2 1991 W07 6.43
3 1991 W08 6.58
4 1991 W09 7.22
5 1991 W10 6.88
6 1991 W11 6.95
Create and interpret the following graphs:
- Timeplot with sufficient resolution to spot the beginning of each year.
- Seasonal and seasonal subseries plots to examine a period of 1 year.
- Lagplots and correlogram to examine if there is any seasonality.
us_gasoline %>%
autoplot(Barrels) +
# We pick scale_x_yearweek() because this is the structure of the time index
# of the series.
scale_x_yearweek(breaks = "1 years",
minor_breaks = "1 year") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))Positive trend until 2008 -> Global financial crisis.
Production seems to take positive trend again after 2012.
Shape of seasonality seems to have changed over time.us_gasoline %>%
gg_season(Barrels)The noise makes it difficult to spot the seasonal pattern.us_gasoline %>%
gg_subseries(Barrels, period = "year")The bluelines mark the average levels in each week of the year.
On average, production seems to increase until Q3 to then decrease.us_gasoline %>%
gg_lag(Barrels, geom='point')us_gasoline %>%
ACF(Barrels, lag_max = 160) %>% autoplot()In this case, the lagplots are not all that useful because of the big amount
of weeks per year. In order to see the seasonal pattern on the correlogram,
we need to increase the number of lags at least beyond 52
(number of weeks in a year).
If we do this, we clearly observe yearly seasonality.